Example jupyter_spark notebook

This is an example notebook to demonstrate the jupyter_spark notebook plugin.

It is based on the approximating pi example in the pyspark documentation. This works by sampling random numbers in a square and counting the number that fall inside the unit circle.



In [1]:

    
import sys
from random import random
from operator import add

from pyspark.sql import SparkSession

Create a SparkSession and give it a name.

Note: This will start the spark client console -- there is no need to run spark-shell directly.



In [2]:

    
spark = SparkSession \
            .builder \
            .appName("PythonPi") \
            .getOrCreate()

partitions is the number of spark workers to partition the work into.



In [3]:

    
partitions = 2

n is the number of random samples to calculate



In [4]:

    
n = 100000000

This is the sampling function. It generates numbers in the square from (-1, -1) to (1, 1), and returns 1 if it falls inside the unit circle, and 0 otherwise.



In [5]:

    
def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0

Here's where we farm the work out to Spark.



In [6]:

    
count = spark.sparkContext \
    .parallelize(range(1, n + 1), partitions) \
    .map(f) \
    .reduce(add)



In [7]:

    
print("Pi is roughly %f" % (4.0 * count / n))









    



Pi is roughly 3.141880

Shut down the spark server.



In [8]:

    
spark.stop()